🎧 Beatbot

GitHub: https://github.com/alexcrist/beatbot

What if you could beatbox* into an app that could process the audio, determine which sounds are which, and produce a matching audio track of real drum noises?

From Siri to Google Assistant, a handful of applications have explored and mastered decoding audio into human speech. The push for these kinds of excellent speech processing algorithms has resulted in a wealth of knowledge on the topic from blog posts to academic papers.

Beatbot uses these documented techniques along with a few novel strategies to analylze audio and replace beatbox sounds with drum sounds.

Here's are a few examples of what it can do.

* Beatboxing is vocal percussion (i.e.: "pft tss kah tss pft tss kah")

📼 Examples

Pretty neat! Now lets see how Beatbot works behind the scenes.


🔮 How Beatbot Works

Beatbot operates in three steps.

  1. First, it locates all of the beatbox sounds in the beatbox audio
  2. Next, it classifies which sounds are which
  3. And finally, it replaces the beatbox sounds with similar sounding drum sounds

🔬 Part 1: Beat location

The first step in Beatbot is to locate the starts and end of each beatbox sound.

Let's start out by taking another look at the input audio from Example 1.

Visually, we can already begin to see where the beats are located, but the raw waveform isn't good enough. Certain loud noises don't register as large amplitude spikes while certain quiet noises do.

To get a get a cleaner representation of the audio's loudness, we'll start by taking a look at the the frequencies of the audio over time.

We'll do this by applying the "Fourier transform" to small, overlapping windows of our audio wave.

In this visualization, known as a spectrogram, we can clearly see where each beat is located. Yellower colors indicate loudness while bluer colors indicate quietness. Positions near the top represent high frequencies, while lower positions represent low frequencies.

To determine the loudness of our audio at any point, we just need to add up all the yellow energy in a given time column.

Check out the code for more specifics on how we calculated this spectrogram.

Summing our spectrogram by column gives us the volume of the beatbox track over time.

We now can see clear peaks at each beatbox sound.

We'll now use a simple peak finding algorithm to determine how may peaks exist at each peak peak prominence value from zero to the largest peak prominence.

This approach will give us the flexibility to use Beatbot on tracks with varying volume.

The above graph shows how the number of peaks located changes as the required peak prominence is set at different values between zero and our maximum.

In the middle of this graph, an unusually long flat section exists where as we change the required peak prorminnence value, the resulting number of peaks located does not change.

This flat zone represents our most significant peaks in the signal and is the number of peaks we'll look for in our volume signal.

Having determined the number of peaks in the volume signal, we obtain the start and end locations of each peak by looking for the points to the left and right of the peaks' tips which lie at 70% of the each peak's total prominence.

The value '70%' was chosen through trial and error.

And finally, here are our beats' starts and ends overlayed onto the orginal audio signal.


🌌 Part II: Beat classification

Now that we the locations of all of the beats in the track, we need to determine which beats are which.

To start off, lets look at our beats.

Listening to the audio again, our expected beat pattern should be:

  • 0 1 1 1
  • 2 1 1 1
  • 1 1 0 1
  • 2 1 1 1
  • 0 1 1 1
  • 2 1 1 1
  • 1 1 0 1
  • 2 1 1 1
  • 3

Where:

  • 0 = "pft"
  • 1 = "tss"
  • 2 = "khh"
  • 3 = Unintentional knock

Lets visualize this.

Goal in mind, our first task in beat classification is to featurize our beats in some way that will let us compare them to one another. A proven audio featurization popular in speech processing is "Mel-frequency cepstral coefficients" (MFCCs).

MFCCs are feature vectors that are created in three steps:

  1. The input audio is windowed and transformed into its frequency comopnents via the Fourier transform
  2. Mel-coefficients are extracted from each set of frequencies in time (the Mel scale is essentially the human hearing scale)
  3. These values are compressed usually using the discrete cosine transform

Now lets extract some MFCCs from our beats.

These all kind of look the same. That's fine though because the computer can tell them apart just fine.

Now we'll compare each feature set to every other feature set using Dynamic Time Warp Mathcing (DTW). DTW is a strategy that allows us to compare feature sets of different sizes.

The result of each comparison is a distance value that represents how similar the two feature sets are. We'll store these values in a distance matrix.

In our distance matrix here, a dark pixel at coordinate (x, y) indicates that beat x and beat y are similar. A yellow pixel indicates dissimilarity.

With this distance matrix, we can now use a clustering algorithm to group together similar beats. Let's try using hierarchical clustering.

The above 'dendrogram' is the result of our hierarchical clustering. It represents multiple clustering options; to get any single clustering, simply make a horizontal cut across the chart and observe which nodes are connected.

The question for us is- where should we make this horizontal cut? How many clusters should we choose?

One method of determining this is by looking at how the number of clusters changes as we change the position of our horizontal cut.

The purple line in the above graph shows how the number of clusters changes as we move the position of the cut.

A popular method of determining a 'good' number of clusters is to look for the steepest slope change in this purple line. We can do this by locating the maximum value of the purple line's second differential (show as the orange line).

This is known as a 'knee point'.

Having chosen a cluster quantity, all that's left to do is determine which beats belong to which clusters. That's shown above.

And it worked great! Our only misclassification is beat #32, the unintentional knock noise.


🥘 Part III: Beat replacement

We now know both where the beats are and what they represent. All that remains is to build a new track with similar sounding drums.

I've curated fifty drum kit sounds to choose from. For each beatbox sound, we'll run through these fifty drum sounds and determine which is the most similar using MFCCs and DTW.

Great, we have our three drum sounds. Let's build the final output.

Final product!
And the original again:

🎆 Conclusion

This project has been fun and the Beatbot algorithm works decently well!

A fined tuned version could potentially be useful as a tool in music production for amateurs or professionals looking to make quick beat mock ups.

To acheive a Beatbot algorithm that works even better, we'd probably want to use more recent, cutting edge speech processing techniques such as convolutional neural networks (CNNs).

Given a enormous unlabeled set of beatbox audio data (like 10,000+ hours), we could create a better featurization model than MFCCs by using a strategy like wav2vec. Wav2vec is a project that performs unsupervised training on a CNN with massive amounts of audio data, creating a cutting edge audio featurization model.